State the necessary assumptions for a multiple linear regression model to be valid in terms of conducting hypothesis tests and providing prediction intervals.
True
Feature selection is intended to select the best subset of predictors. These techniques are designed to explain the data in the best way possible, ideally this means fewer predictors. More predictors add noise and can increase the cost. We can also run into the issue of collinearity by having too many predictors. There are a few different techniques for feature selection: backward elimination, forward selection, and stepwise regression. I think that stepwise regression should not be done without an analyst due to its very nuanced process. The steps aren’t linked to the goals of a study and easily fall victim to missing the optimal model. Additionally, this technique doesn’t really address the question of interest. Stepwise selection tends to pick models that are smaller and have the unintended consequence of removing predictors that may be useful to the question of interest, but due to the nature of stepwise selection, will remove a non-statistically significant, but useful predictor. Feature selection methods on their own cannot be used to create ideal models without an analyst present who can take an objective look at the question of interest and the model.
If the final model includes a third, fourth, and fifth level polynomial term the model will be very complex. Adding more feature to a model will always increase training accuracy (low bias). This model will have high variance, since the model is memorizing the training data. I would expect ASE to be higher on the testing data because the model was fit with more variables than necessary.
##Exercise 1 EDA
#import some libraries
library(ISLR)
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
Auto$cylinders <- as.factor(Auto$cylinders)
Auto$origin <- as.factor(Auto$origin)
attach(Auto)
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 3: 4 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 4:199 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 5: 3 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 6: 83 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 8:103 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 1:245 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 2: 68 ford pinto : 5
## Median :15.50 Median :76.00 3: 79 toyota corolla : 5
## Mean :15.54 Mean :75.98 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 amc hornet : 4
## Max. :24.80 Max. :82.00 chevrolet chevette: 4
## (Other) :365
t(aggregate(mpg~cylinders,data = Auto,summary))
## [,1] [,2] [,3] [,4] [,5]
## cylinders "3" "4" "5" "6" "8"
## mpg.Min. "18.00000" "18.00000" "20.30000" "15.00000" " 9.00000"
## mpg.1st Qu. "18.75000" "25.00000" "22.85000" "18.00000" "13.00000"
## mpg.Median "20.25000" "28.40000" "25.40000" "19.00000" "14.00000"
## mpg.Mean "20.55000" "29.28392" "27.36667" "19.97349" "14.96311"
## mpg.3rd Qu. "22.05000" "32.95000" "30.90000" "21.00000" "16.00000"
## mpg.Max. "23.70000" "46.60000" "36.40000" "38.00000" "26.60000"
t(aggregate(mpg~cylinders,data = Auto,mean))
## [,1] [,2] [,3] [,4] [,5]
## cylinders "3" "4" "5" "6" "8"
## mpg "20.55000" "29.28392" "27.36667" "19.97349" "14.96311"
t(aggregate(mpg~cylinders,data = Auto,sd))
## [,1] [,2] [,3] [,4] [,5]
## cylinders "3" "4" "5" "6" "8"
## mpg "2.564501" "5.670546" "8.228204" "3.828809" "2.836284"
par(mfrow=c(1,3))
plot(horsepower,mpg,xlab = "horsepower",ylab = "mpg")
new <- data.frame(horsepower=seq(30,300,.1))
lines(seq(30,300,.1),predict(lm(mpg~horsepower),newdata = new),col="red",lwd=4)
plot(as.factor(cylinders),mpg,xlab="cylinders",ylab="mpg",title="Auto Data Set",col=c(7,32,57,82,107))
plot(weight,mpg)
new2 <- data.frame(weight=seq(1600,5200,1))
lines(seq(1600,5200,1),predict(lm(mpg~weight),newdata = new2),col="red",lwd=4)
pairs(Auto[,-c(2,9)])
pairs(Auto[,-c(2,9)],col=cylinders)
library(car)
## Loading required package: carData
Auto <- Auto[,-9]
full.model <- lm(mpg~.,data=Auto)
vif(full.model)[,3]^2
## cylinders displacement horsepower weight acceleration year
## 1.984867 23.271418 10.559221 11.723047 2.684794 1.323483
## origin
## 1.531211
#Question 5
Multicollinearity. appears to be a problem in this dataset. Specifically with the displacement, horsepower, and weight variables. Based on displacement and its very high VIF score, I believe this variable could be removed. Thinking about the dataset, a larger engine probably weighs more and has more displacement and horsepower, so it makes sense these variables are related and would have issues with multicollinearity.
par(mfrow=c(1,3))
plot(horsepower,mpg,xlab = "horsepower",ylab="mpg")
new <- data.frame(horsepower=seq(30,300,.1))
horse.model <- lm(mpg~horsepower)
lines(seq(30,300,.1),predict(horse.model,newdata = new),col="red",lwd=4)
plot(horse.model$fitted.values,horse.model$residuals,xlab = "Fitted Values",ylab = "Residuals")
plot(horsepower,horse.model$residuals,xlab = "Horsepower",ylab = "Residuals")
par(mfrow=c(1,3))
plot(horsepower,mpg,xlab = "horsepower",ylab = "mpg")
new <- data.frame(horsepower=seq(30,300,.1))
horse.model2 <- lm(mpg~horsepower +I(horsepower^2))
plot(horse.model2$fitted.values,horse.model2$residuals,xlab = "Fitted Values",ylab="Residuals")
plot(horsepower, horse.model2$residuals,xlab = "Horsepower",ylab = "Residuals")
par(mfrow=c(1,3))
plot(horsepower,mpg,xlab="horsepower",ylab = "mpg",col=cylinders)
new <- data.frame(horsepower=seq(30,300,.1))
horse.model2 <- lm(mpg~horsepower+I(horsepower^2))
lines(seq(30,300,.1),predict(horse.model2,newdata = new),col="red",lwd=4)
plot(horse.model2$fitted.values,horse.model2$residuals,xlab = "Fitted Values",ylab = "Residuals")
plot(horsepower,horse.model2$residuals,xlab="Horsepower",ylab = "Residuals")
par(mfrow=c(1,4))
plot(horse.model2)
#Question 6
#plot(mfrow=c(1,3))
plot(x=horsepower,y=mpg,xlab = "horsepower",ylab = "mpg",col=cylinders)
new <- data.frame(horsepower=seq(30,300,.1))
horse.model3 <- lm(log(mpg)~horsepower+I(horsepower^2))
lines(seq(30,300,.1),predict(horse.model3,newdata = new),col="red",lwd=4)
plot(horse.model3$fitted.values,horse.model3$residuals,xlab = "Fitted Values",ylab = "Residuals")
plot(horsepower,horse.model3$residuals,xlab="Horsepower",ylab="Residuals")
par(mfrow=c(1,4))
plot(horse.model3)
summary(horse.model3)
##
## Call:
## lm(formula = log(mpg) ~ horsepower + I(horsepower^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66460 -0.12041 0.00316 0.11349 0.66376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.402e+00 7.260e-02 60.639 < 2e-16 ***
## horsepower -1.711e-02 1.255e-03 -13.632 < 2e-16 ***
## I(horsepower^2) 3.901e-05 4.922e-06 7.925 2.44e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1764 on 389 degrees of freedom
## Multiple R-squared: 0.7324, Adjusted R-squared: 0.731
## F-statistic: 532.2 on 2 and 389 DF, p-value: < 2.2e-16
There is no clear pattern to the residuals, they appear to be equally distributed throughout the plot and show no obvious shape indicating that the data does not show any departures from normality. The transformation does not seem to correct the constant variance issue.I think that including the quadratic term is still beneficial to this model. BAsed on the summary statistics, the model is significant as a whole and each of the predictors are as well.
Question 7, part 2
We must keep in mind that we deleted observations that had three or five cylinder engines, purely for our own convenience. I think that this will make our data a bit noisier because we have selectively removed some data. We have also reduced any interesting trends that may arise from these data points.
library(leaps)
#Part 3
Auto <- Auto[-120,]
set.seed(1234)
index<-sample(1:dim(Auto)[1],192,replace=F)
train<-Auto[index,]
test<-Auto[-index,]
reg.fwd=regsubsets(log(mpg)~displacement+weight+acceleration+year+origin,data=train,method="forward",nvmax=7)
predict.regsubsets =function (object , newdata ,id ,...){
form=as.formula (object$call [[2]])
mat=model.matrix(form ,newdata )
coefi=coef(object ,id=id)
xvars=names(coefi)
mat[,xvars]%*%coefi
}
testASE<-c()
#note my index is to 20 since that what I set it in regsubsets
for (i in 1:6){
predictions<-predict.regsubsets(object=reg.fwd,newdata=test,id=i)
testASE[i]<-mean((log(test$mpg)-predictions)^2)
}
par(mfrow=c(1,1))
plot(1:6,testASE,type="l",xlab="# of predictors",ylab="test vs train ASE",ylim=c(0.01,0.05))
index<-which(testASE==min(testASE))
points(index,testASE[index],col="red",pch=10)
rss<-summary(reg.fwd)$rss
lines(1:6,rss/100,lty=3,col="blue") #Dividing by 100 since ASE=RSS/sample size
reg.final=regsubsets(log(mpg)~displacement+weight+acceleration+year+origin,data=Auto,method="forward",nvmax=6)
coef(reg.final,3)
## (Intercept) weight year origin2
## 1.5633405896 -0.0003014851 0.0319115667 0.0483012935
final.model<-lm(log(mpg)~weight+year+origin,data=Auto)
summary(final.model)
##
## Call:
## lm(formula = log(mpg) ~ weight + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42350 -0.07213 0.01214 0.07139 0.38223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.508e+00 1.448e-01 10.413 < 2e-16 ***
## weight -2.861e-04 9.369e-06 -30.542 < 2e-16 ***
## year 3.183e-02 1.754e-03 18.145 < 2e-16 ***
## origin2 7.259e-02 1.877e-02 3.868 0.000129 ***
## origin3 5.790e-02 1.870e-02 3.096 0.002101 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1203 on 386 degrees of freedom
## Multiple R-squared: 0.8765, Adjusted R-squared: 0.8752
## F-statistic: 684.7 on 4 and 386 DF, p-value: < 2.2e-16
plot(exp(final.model$fitted.values),Auto$mpg,xlab="Predicted",ylab="AvgWinnings")
lines(c(0,400000),c(0,400000),col="red")
head(predict(final.model,Auto,interval="predict"))
## fit lwr upr
## 1 2.733423 2.495769 2.971076
## 2 2.679342 2.441705 2.916978
## 3 2.752881 2.515208 2.990553
## 4 2.753739 2.516066 2.991412
## 5 2.749161 2.511493 2.986829
## 6 2.493921 2.255956 2.731886
Based on the above model, we need to use weight, year, and origin as predictors.
#Part 4
summary(final.model)
##
## Call:
## lm(formula = log(mpg) ~ weight + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42350 -0.07213 0.01214 0.07139 0.38223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.508e+00 1.448e-01 10.413 < 2e-16 ***
## weight -2.861e-04 9.369e-06 -30.542 < 2e-16 ***
## year 3.183e-02 1.754e-03 18.145 < 2e-16 ***
## origin2 7.259e-02 1.877e-02 3.868 0.000129 ***
## origin3 5.790e-02 1.870e-02 3.096 0.002101 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1203 on 386 degrees of freedom
## Multiple R-squared: 0.8765, Adjusted R-squared: 0.8752
## F-statistic: 684.7 on 4 and 386 DF, p-value: < 2.2e-16
plot(final.model)
Based on the plots above, I think that this model meets all assumptions and is suitable for analysis.